Section 1-1 - Filling-in Missing Values

In the previous section, we ended up with a smaller set of predictions because we chose to throw away rows with missing values. We build on this approach in this section by filling in the missing data with an educated guess.

We will only provide detailed descriptions on new concepts introduced.

Pandas - Extracting data


In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('../data/train.csv')

Pandas - Cleaning data


In [2]:
df = df.drop(['Name', 'Ticket', 'Cabin'], axis=1)

Similar to the previous section, we review the data type and value counts.


In [3]:
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 9 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Fare           891 non-null float64
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(2)

There are a number of ways that we could fill in the NaN values of the column Age. For simplicity, we'll do so by taking the average, or mean, of values of each column. We'll review as to whether taking the median would be a better choice in a later section.


In [4]:
age_mean = df['Age'].mean()
df['Age'] = df['Age'].fillna(age_mean)

Exercise

  • Write the code to replace the NaN values by the median, instead of the mean.

Taking the average does not make sense for the column Embarked, as it is a categorical value. Instead, we shall replace the NaN values by the mode, or most frequently occurring value.


In [5]:
from scipy.stats import mode

mode_embarked = mode(df['Embarked'])[0][0]
df['Embarked'] = df['Embarked'].fillna(mode_embarked)

In [6]:
df['Gender'] = df['Sex'].map({'female': 0, 'male': 1}).astype(int)
df['Port'] = df['Embarked'].map({'C':1, 'S':2, 'Q':3}).astype(int)

df = df.drop(['Sex', 'Embarked'], axis=1)

cols = df.columns.tolist()
cols = [cols[1]] + cols[0:1] + cols[2:]
df = df[cols]

We now review details of our training data.


In [7]:
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 9 columns):
Survived       891 non-null int64
PassengerId    891 non-null int64
Pclass         891 non-null int64
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Fare           891 non-null float64
Gender         891 non-null int64
Port           891 non-null int64
dtypes: float64(2), int64(7)

Hence have we have preserved all the rows of our data set, and proceed to create a numerical array for Scikit-learn.


In [8]:
train_data = df.values

Scikit-learn - Training the model


In [9]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators = 100)
model = model.fit(train_data[0:,2:],train_data[0:,0])

Scikit-learn - Making predictions


In [10]:
df_test = pd.read_csv('../data/test.csv')

We now review what needs to be cleaned in the test data.


In [11]:
df_test.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)

In [12]:
df_test = df_test.drop(['Name', 'Ticket', 'Cabin'], axis=1)

As per our previous approach, we fill in the NaN values in the column Age with the mean.


In [13]:
df_test['Age'] = df_test['Age'].fillna(age_mean)

For the column Fare, however, it makes sense to fill in the NaN values with the mean by the column Pclass, or Passenger class.


In [14]:
fare_means = df.pivot_table('Fare', index='Pclass', aggfunc='mean')

In [15]:
fare_means


Out[15]:
Pclass
1         84.154687
2         20.662183
3         13.675550
Name: Fare, dtype: float64

Here we created a pivot table by calculating the mean of the column Fare by each Pclass, which we will use to fill in our NaN values.


In [16]:
df_test['Fare'] = df_test[['Fare', 'Pclass']].apply(lambda x:
                            fare_means[x['Pclass']] if pd.isnull(x['Fare'])
                            else x['Fare'], axis=1)


/Users/savarin/anaconda/envs/py27/lib/python2.7/site-packages/pandas/core/index.py:503: FutureWarning: scalar indexers for index type Int64Index should be integers and not floating point
  type(self).__name__),FutureWarning)

This is one of the more complicated lines of code we'll encounter, so let's unpack this.

First, we look at each of the pairs (Fare, Pclass) (i.e. lambda x). From this pair, we check if the Fare part is NaN (i.e. if pd.isnull(x['Fare'])). If Fare is NaN, we look at the Pclass value of that pair (i.e. x['PClass']), and replace the NaN value the mean fare of that class (i.e. fare_means[x['Pclass']]). If Fare is not NaN, then we keep it the same (i.e. else x['Fare']).


In [17]:
df_test['Gender'] = df_test['Sex'].map({'female': 0, 'male': 1}).astype(int)
df_test['Port'] = df_test['Embarked'].map({'C':1, 'S':2, 'Q':3})

df_test = df_test.drop(['Sex', 'Embarked'], axis=1)

test_data = df_test.values

output = model.predict(test_data[:,1:])

Pandas - Preparing for submission


In [18]:
result = np.c_[test_data[:,0].astype(int), output.astype(int)]
df_result = pd.DataFrame(result[:,0:2], columns=['PassengerId', 'Survived'])

df_result.to_csv('../results/titanic_1-1.csv', index=False)

Now we'll discover that our submission has 418 predictions, and can proceed to make our first leaderboard entry.


In [19]:
df_result.shape


Out[19]:
(418, 2)

Congratulations on making your first Kaggle submission!!